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Abstract —In cyber-physical systems such as in-vehicle wireless 
sensor networks, a large number of sensor nodes continually 
generate measurements that should be received by other nodes 
such as actuators in a regular fashion. Meanwhile, energy- 
efficiency is also Important in wireless sensor networks. Motivated 
by these, we develop scheduling policies which are energy 
efficient and simultaneously maintain “regular” deliveries of 
packets. A tradeoff parameter is introduced to balance these 
two conflicting objectives. We employ a Markov Decision Process 
(MDP) model where the state of each client is the time-slnce- 
last-delivery of its packet, and reduce it into an equivalent finite- 
state MDP problem. Although this equivalent problem can be 
solved by standard dynamic programming techniques, it suffers 
from a high-computational complexity. Thus we further pose the 
problem as a restless multi-armed bandit problem and employ 
the low-complexity Whittle Index policy. It is shown that this 
problem is indexable and the Whittle indexes are derived. Also, 
we prove the Whittle Index policy is asymptotically optimal and 
validate its optimality via extensive simulations. 

I. Introduction 

Cyber-physical systems typically employ wireless sensors 
for keeping track of physical processes such as temperature 
and pressure. These nodes then transmit data packets contain¬ 
ing these measurements back to the access point/base station. 
Moreover, these packets should be delivered in a “regular” 
way. So, time between successive deliveries of packets, i.e. 
inter-delivery time, is an important performance metric |T], 
0. Eurthermore, many wireless sensors are battery powered. 
Thus, energy-efficiency is also important. 

We address the problem of satisfying these dual conflict¬ 
ing objectives; inter-delivery time requirement and energy- 
efficiency. We design wireless scheduling policies that support 
the inter-delivery requirements of such wireless clients in an 
energy-efficient way. In 0, 0, the authors analyzed the 
growth-rate of service irregularities that occur for the case of 
multiple clients sharing a wireless network and when the sys¬ 
tem is in heavy traffic regime. The inter-delivery performance 
of the Max Weight discipline under the heavy traffic regime 
was studied in 0 . To the authors’ best knowledge, the inter¬ 
delivery time was first considered in 0^0 as a performance 
metric for queueing systems, where a sub-optimal policy is 
proposed to trade off the stablization of the queues and service 
regularity. However, this is different from our problem, where 
the arrival process does not need to be featured. In our previous 
work 0, throughput is traded off for better performance with 


respect to variations in inter-delivery times. However, tunable 
and heterogeneous inter-delivery requirements have not been 
considered. 

In this paper, we formulate the problem as a Markov 
Decision Process (MDP) with a system cost consisting of 
the summation of the penalty for exceeding the inter-delivery 
threshold and a weighted transmission energy consumption. 
An energy-efficiency weight parameter 77 is introduced to 
balance these two aspects. To solve this infinite-state MDP 
problem, we reduce it to an equivalent MDP comprising of 
only a finite number of states. This equivalent finite-state 
finite-action MDP can be solved using standard dynamic 
programming (DP) techniques. 

The significant challenge of this MDP approach is the com¬ 
putational complexity, since the state-space of the equivalent 
MDP increases exponentially in the number of clients. To 
address this, we further formulate this equivalent MDP as a 
restless multi-armed bandit problem (RMBP), with the goal 
of exploiting a low-complexity index policy. 

In this RMBP, we first derive an upper bound on the 
achievable system reward by exploring the structure of a 
relaxed-constraint problem. Then, we determine the Whittle 
index for our multi-armed restless bandit problem, and prove 
that the problem is indexable. In addition, we show the 
resulting index policy is optimal in certain cases, and validate 
the optimality by a detailed simulation study. The impact of the 
energy-efficiency parameter 77 is also studied in the simulation 
results. 

II. System Model 

Consider a cyber-physical system in which there are N 
wireless sensors and one access point (AP). We will assume 
that time is discrete. At most L sensors can simultaneously 
transmit in a time slot. In each time-slot, a control message 
is broadcasted at the beginning by the AP to inform which 
set of L sensors can transmit in the current time-slot. Each of 
the assigned sensors then makes a sensor measurement and 
transmits its packet. The length of a time slot is the time 
required for the AP to send the control message plus the time 
required for the L assigned clients to prepare and transmit a 
package. 

The wireless channel connecting the sensor and the AP is 
unreliable. When client n is selected to transmit, it succeeds 


in delivering a packet with a probability G ( 0 , 1 ). Further¬ 
more, each attempt to transmit a packet of client n consumes 
En units of energy. 

The QoS requirement of client n is specified through an 
integer, the packet inter-delivery time threshold Tn- The cost 
incurred by the system during the time interval { 0 , 1 ,..., T} 
is given by. 


with P [Xn{t + 1 ) = y„|X„(f) = Xn, Unit) = m„] 

if = 0 and = 1, 
if ?/„ = a;„ + 1 and = 1, 

tf Vn — 1 and tin — 0, 

otherwise. 
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where D) is the time between the deliveries of the i-th and 
(i + l)-th packets for client n, is the number of packets 

delivered for the n-th client by the time T, f^(n) is the time 

* ^ fn) 

slot in which the i-th package for client n is delivered, 
is the total number of slots in { 0 , 1 , • • • , T— 1 } in which the n- 
th client is selected to transmit, and (a)'*' max{a, 0 }. The 

second term is included since, otherwise, no transmission at 
all will result in the least cost. The last term weights the total 
energy consumption in T time-steps by a non-negative energy- 
efficiency parameter ry, which tunes the weightage given to 
energy conservation. The access point’s goal is to select at 
most L clients to transmit in each time-slot from among the 
N clients, so as to minimize the above cost. 


The T-horizon optimal cost-to-go from initial state x is 
given by. 




T-1 N 


t —0 n=l 


+ (X„(f) + l-r„)+l{X„(f+l) =0}) X(0) =x| 


where !{•} is the indicator function, and X{T) := 0 (which 
leads to recovering the second term in the cost ([T])), and the 
minimization is over the class of history dependent policies. 
The Dynamic Programming (DP) (see 0 ) recursion is, 

r ^ 

Pt(x) = min E ,7 ^ E^u^ + ^ (u) 

■ N 

■ (a:„ + 1 - r„)'^ l{y„ = 0} -f Er-i(y) 

n—1 

( 2 ) 



III. Reduction to Finite State Problem 


In the following, vectors will be denoted bold font, i.e., 
a := (oi,..., Oat). Define a A b := (oi A 6 i, ..., cat A bff )- 
Random processes will be denoted by capitals. 

We formulate our system as a Markov Decision Process, 
as follows. The system state at time-slot t is denoted by a 
vector Xf) := (Xi (t), • • • ,XAr(f)), where X„(f) is the 
time elapsed since the latest delivery of client n’s packet. 
Denote the action at time t as Uft) := {Ui (t) ,■ ■ ■ ,Un (t)), 
with ^n=i Unit) < L for each t, where 


Unit) 


1 if client n is selected to transmit in slot f, 
0 otherwise. 


The system state evolves as. 


, , 0 if a packet of client n is delivered in t, 

Xnit+l) = ^ 

-f 1 otherwise. 

Thus, the system forms a controlled Markov chain (denoted 
MDP- 1 ), with the transition probabilities given by, 

pM^-'Cu) := P [Xit + 1 ) = y\Xit) = X, ( 7 (f) = u] 

N 

= 11 ? [Xnit + 1 ) = yn\Xnit) = Unit) = , 

n—1 


The above problem, denoted as MDP- 1 , involves a count¬ 
ably infinite state space. The following results show that it can 
be replaced by an equivalent finite state MDP 

Lemma 1. For the MDP- 1 , we have, Vxi, • • • , > 0 , 

Vrixi,--- ,xn) = Xi-GVrixi, - ■ ■ ,xn)- 

Moreover, the optimal actions for the states (xi,--- ,Ti -f 
Xi, - ■ ■ ,Xn) and (xi, • • • ,Ti, - ■ ■ , x^) are the same. 

Proof: Let us consider the MDP -1 starting from two 
different initial states, x = (xi, • • • -f Xi, • • • ,xn) and 
X = (xi, • • • , Ti, • • • ,xn), and compare their evolutions. 
Construct the processes associated with both the systems on 
a common probability space and couple stochastically the 
successful transmissions for the two systems. Let tt be an 
arbitrary history-dependent policy that is applied to in the first 
system (starting in state x). Corresponding to tt, there is a 
policy TT in the second system, which takes the same actions 
as the policy tt at each time slot. Then all the packet-inter- 
delivery times for both the processes are the same, except for 
the first inter-delivery time of the z-th client, which is larger for 
the former system as compared to the latter by x^. In addition. 
Since the policy tt is arbitrary, Lr(x) > Xi Vt(x). The 
inequality in the other direction is proved similarly. The proof 
of the second statement follows by letting tt be the optimal 
policy. ■ 








The optimal cost-to-go function for MDP-2 is, 


Corollary 2. For any system state x such that < r„, Vn, 


Vrfx) = min E 

Un<L 


\ X! + 1 {a;„ = T„}) 

X r) 


V 


Proof: Consider the equation (|^ and the following two 
cases: 


1) The initial state x is such that < T„,Vn. Then (x„ + 

1 — r„)+ = 0 and l{x„ = t„} = 0. In addition, for any 
action u, if y is any state such that ' (u) > 0, then 

y satisfies t/„ < r„, Vn, which shows, y — y A t. 

2) There exists an i such that the initial state x satisfies Xi = 

Ti. Let us first assume there is only one client i satisfying 
Xi = Ti and that Xj < Tj,yj 7 ^ i. Then, for any action 
u, if y is any state such that P^y^''(u) > 0 , we have 

Vj < Tj , Vj 7 ^ i, and also yi is either 0 or +1. If = 0 

and yj < Tj,yj 7 ^ i, then {xi + l- Ti)+l{yi = 0 ) = 1 
and y = y A T. If yi = n + 1, and y^ < Tj,yj 7 ^ i, 
then from Lemma[^ VT-i(y) = l + VT-i(yAT). Thus, 
when there is only one client i satisfying Xi = Ti, the 
r.h.s (right-hand side) of (|^ can be rewritten as. 


min E 


vY^nUn + 1 + E 1 (y ^ 

n y 



The case where there are one or more clients j i 
satisfying Xj = Tj is proved similarly. 


The following lemma can be easily derived, the proof of 
which is omitted due to space constraints. 

Lemma 3. Y{t) := 2f(f) At is a Markov Decision Process 

with P [Yit + l)\Y{t), • • • , YiO),U{t), • • • , P(0)] 

=P [Yit + l)\Y{t),U{t)]. 


Now we construct another MDP, denoted MDP-2, which 
is equivalent to the MDP-1 in an appropriate sense. We will 
slightly abuse notation and continue to use the symbols Y (t) 
and U{t) for states and controls. 

For Yn{0) € {0,1, • • • , T„}, let E„(f) evolves as. 


Yn{t+1) 


0 if a packet is delivered for client n at t, 
(F„(t)-I-1) A r„ otherwise. 


Denote by the transition probabilities of the resulting 

process Y{t) := {Yi{t),- ■ ■ ,y/v(f)) on the state space Y := 
n^=i{0i Ij ■ ■ ■ ) Tn}, where the transition probabilities, 

P [Yn{t+ 1) = y„\Yn{t) = Xn,Un(t) = U„] 

Pn if 2 /n = 0 and = 1, 

1 Pn if yn — {Xn “f 1) A Tn and Un — 1, 

1 if = (xn -f 1) A r„ and = 0, 

0 otherwise. 



T-l N 




t—0 n—1 


-f T]EnUn{t) 


y(o) = xLvxg Y. 


( 5 ) 


Theorem 4. MDP-2 is equivalent to the MDP-1 in that: 

1) MDP-2 has the same transition probabilities as the ac¬ 
companying process of MDP-1, i.e., the process 2f(f)AT; 

2) Both MDPs satisfy the recursive relationship in (|^; thus, 
their optimal cost-to-go functions are equal for each 
starting state x with Xn < T„,Vn; 

3) Any optimal control for MDP-1 in state x is also optimal 
for MDP-2 in state x A r. 

Proof: Statement 1) directly follows Lemma The DP 
recursion for the optimal cost in MDP-2 is 

= min E<^ -f 1 {a;„ = t„}) 

u:E Un<L I 

^ — y n 

+ E^xT''^T-i(y)|. (6) 

y ^ 

Thus, statement 2) is obtained from ^ and Corollary In 
addition, statement 3) follows Lemma and statement 1). ■ 
As a result, we focus on MDP-2 in the sequel. 


IV. Optimal Index Policy for the Relaxed Problem 


A. Formulation of Restless Multi-armed Bandit Problem 

MDP-2, with a finite state space, can be solved in a 
finite number of steps by standard DP techniques (see 0 )- 
However, even for a finite time-horizon, it suffers from high 
computational complexity, since the cardinality of the state 
space increases exponentially in the number N of clients. 

To overcome this, we formulate MDP-2 as an infinite- 
horizon restless multi-armed bandit problem ( 0, 0), and 
obtain an Index policy which has low complexity. 

We begin with some notations: Denote by a the maximum 
fraction of clients that can simultaneously transmit in a time 
slot, i.e., a = L/N. The process Yn{t) associated with client 
n is denoted as project n in conformity with the bandit 
nomenclature. If = 1, the project n is said to be active 

in slot f; while if Unit) = 0, it is said to be passive in slot t. 

The infinite-horizon problem is to solve, with Y(0) = x € 

V, 


2 T-l N 

max liminf —E[VV-l{y„(f) = r„} -pEnUnit) 
N 

s.t. ^(l-(/„(f)) > {l-a)A^, VL 

n—1 


(7) 

( 8 ) 


Note that the system reward is considered instead of the system 
cost. 




max liminf — E 

TT T—>-+00 1 


s.t. liminf ^-E 
T-S-+00 T 


B. Relaxations 

We consider an associated relaxation of the problem Q-([^ 
which puts a constraint only on the time average number of 
active projects allowed: 

"T-l N 

= Tn} - rjEn U„ (t) 

_ t—On—1 
"T-l N 

_ n—1 

Since constraint ( [T0| i relaxes the stringent requirement in ([^, 
it provides an upper bound on the achievable reward in the 
original problem. 

Let us consider the Lagrangian associated with the problem 

(|9ll-([T0ll, with y (0) = X G Y, 

^ rr-i N 

:= liminf —E.^ 

T-j'+oo T 


(9) 

> (1 - a)N. (10) 


1 . 


t—On—1 
T-l N 


+ cjlim inf — E^, 
T—>-+oo T 


EXr = Tn } - rjEn Un (t) 

— u}{l — a)N, 


_ n—1 

where tt is any history-dependent scheduling policy, while w > 
0 is the Lagrangian multiplier. The Lagrangian dual function 
is d{uj) := max^/( tt, w): 

rT-l N 

d(uj) < max liminf —E 
IT T— > + 00 T 




t—0 n=l 


- VEnUnit) + W (1 - Unit)) 


y(o) = x 


— uj{l — a)N 


< max limsup — E 

^ T->+oo T 


T-l N 


5]^-i{y„(f) = r4 


*- t—0 n—1 


- VEnUnit) + W (1 - Unit)) 


y(o) = x 


— a;(l —a)7V 


N 


rT-l 


< max ^ limsup —E ^ -l{y„(t) = t„} 


n=0 ^ L 


- r]EnUnit)+UJ il-Unit)) 


r(o)=x 


— w(l —q;)A^, 


( 11 ) 


where the hrst and the third inequalities hold because of the 
super/sub-additivities of the limit inf/sub (respectively). 

Now, consider the unconstrained problem in the last two 
lines of O- It can be viewed as a composition of N indepen¬ 
dent w-subsidy problems interpreted as follows: For each client 
n, besides the original reward —l{Ynit) = t„} — rjEnUnit), 
when Unit) = 0, it receives a subsidy ui for being passive. 

Thus, the uj-subsidy problem associated with client n is 
dehned as. 


= max limsup — E 

TTri T-> + 00 T 


T-l 




t^o 


- rjEnUnit) -f W (1 - Unit)) 


y„(0) = Xr, 


, ( 12 ) 


where 7r„ is a history dependent policy which decides the 
action Unit) for client n in each time-slot. 

In the following, we hrst solve this w-subsidy problem, and 
then explore its properties to show that strong duality holds 
for the relaxed problem (|9]l-([T0ll, and thereby determine the 
optimal value for the relaxed problem. 

For 9 G {0,1, • • • ,T„} and p G [0,1], we dehne 
to be a threshold policy for project n, as follows: The policy 
(T„(0, p) keeps the project passive at time t if y„(f) < 6. 
However when y„(f) > 9, the project is activated, i.e.. 
Unit) = 1. If Ynit) = 9, then at time t, the project stays 
passive with probability p, and is activated with probability 

1 - p. 

For each project n, associate a function 

Wni9) := PniO + 1)(1 - - pEn, (13) 

where 9 = 0,1, • • • ,r„ — 1. (We elaborate on the physical 
meaning of Wni') later in Section [V|l. 

Lemma 5. Consider the w-subsidy problem ( |T^ for project 
n. Then, 

1) cr„(0,0) is optimal iff the subsidy w < W„(0). 

2) For 9 G {I,-- - ,Tn — 1}, (T„(0,0) is optimal iff the 
subsidy ui satishes Wni9 — 1) < w < Wni9). 

3) CT„(r„,0) is optimal iff w = 1 L„(t — 1). 

4) cr„(r„, 1) is optimal iff w > 1 L„(t — 1). 

In addition, for 0 G {0,1,... , t}, the policies p) : p G 

[0,1]} are optimal when, 

i) O<0<r — 1 and w = Wni9), 

ii) 9 = T and w = Wri(T — 1). 

Furthermore, for any 9 G {0, • • • , r}, under the cr(0,0) policy, 
the average reward earned is, 

Pn9uj - pEn - (1 - 
I +9pn 

Meanwhile, under the 1) policy, the reward is a; — 1. 

Proof: For the w-subsidy problem of project n, let us 
hrst analyze the cr„(0,0) policy. The subscript n is suppressed 
in the following. For each 9 G {0,1,-- - ,t}, ct( 0, 0) is a 
deterministic stationary policy. That is, for each cr(0,O), there 
exists a function p(-) dehned on the state space {0,1, • • • , r} 
of the project, such that Unit) = giYnit)). Further, there exist 
a real number R and a real function / on the state space with 
/(O) = 0 such that, 

R + fii) = -1 {i = r} - gii)Ep -f w (1 - p(i)) 

+ F5(*)/(0) + (1 - p)9ii)f{ii -f 1) A r) 

+ (l-5(*))/((* + l)Ar),Vi = 0,1,-- - ,T. 

The value of R and fii),i = 1, • • • ,t can be obtained by 
solving the t-|-1 equations above, and it can be shown that the 
R is the average expected system reward under this (t( 0, 0) 
policy (see Q). Then, by standard results in inhnite-horizon 























dynamic programming, see |j^, policy a{9, 0) is optimal if 
and only if the following optimality equation is satisfied, 

R + f{i) = max < — l{i = r} — uEr] + a;(l — u) 

«G{0,1} I 

+ puf{0) + (1 - p)uf + 1) A 

+ (l-w)/((* + l)Ar^|,Vi = 0, ••• ,r. (15) 

Similar results hold for the policy (t(t, 1), under which the 
system is always passive. The conditions in l)-4), and the 
average expected system reward under these policies are 
obtained. 

To obtain the conditions i) and ii), note that a{9, 0) = a{9 + 
1,1), and the policy a{d,p),p G (0,1) can be regarded as a 
combination of cr(9,0) and (7(0,1). ■ 

Theorem 6. For the relaxed problem (|9ll-([T0|i and its dual 
d(uj), the following results hold: 

1) The dual function d(u>) satisfies, 

N-l 

d(u!) = f?n(w) — a;(l — a)N. 

n—0 

2) Strong duality holds, i.e., the optimal average reward for 
the relaxed problem, denoted iireb satisfies, 

i?rei = min d(a;). 
w>0 

3) In addition, d(uj) is a convex and piecewise linear func¬ 
tion of w. Thus, the value of i?rei can be easily obtained. 

Proof: For 1), it follows from Lemma that for the 
w-subsidy problem associated with each project n, there is 
at least one stationary optimal policy, and under this policy, 
the optimality equation holds true. Thus, under the optimal 
policy, the limit of the time average reward exists (which is 
closely related to the optimality equation, see 0)- That is, 
the lim sup 7 n_j._|_oc in ( [T2l i can be replaced by limT-j.+oo- As 
a result, all the “less than or equal to” in 0 can be replaced 
by equality signs. This proves the first statement. 

For 2), the strong duality is proved by showing complemen¬ 
tary slackness. The details are omitted due to space constraints. 

For 3), it follows from equation ( [l4| ) that each is 

a piecewise linear function. To prove convexity of 
note that the reward earned by any policy is a linear function 
of w, and the supremum of linear functions is convex. Thus, 
by statement 1), d{uj) is also convex and piecewise linear. In 
addition, since each Rn(oj) can be easily derived from Lemma 
1^ the expression of d(ui) easily follows. Thus, which 
is the minimum value of this known, convex, and piecewise 
linear function d(uj), can be easily obtained. ■ 

V. The Large Client Population Asymptotic 
Optimality of the Index Policy 

The Whittle index (see 0) Wnii) of project n at state i 
is defined as the value of the subsidy that makes the passive 
and active actions equally attractive for the w-subsidy problem 
associated with project n in state i. The n-th project is said to 


be indexable if the following is true: Let i3„(w) be the set of 
states for which project n would be passive under an optimal 
policy for the corresponding w-subsidy problem. Project n is 
indexable if, as uj increases from —oo to -foo, the set Bn{uj) 
increases monotonically from 0 to the whole state space of 
project n. The bandit problem is indexable if each of the 
constituent projects is indexable. 


Lemma 7. The following are true: 

1) The Whittle index Wnii) of project n at state i is, 

Wn(i) = Pn(i + 1)(1 - pEn, 

when i = 0 , 1 , • • • ,t„ - 1 ; while = W„(t„- 1 ). 

2) The stringent-constraint scheduling problem 0-0 is 
indexable. 

3) For each project n, the transition rates of its states in the 
associated w-subsidy problem form a unichain (there is 
a state j G { 0 , 1 , • • • , r„} such that there is a path from 
any state i S { 0 , 1 , • • • , r„} to state j), regardless of the 
policy employed. 

Proof: Statements 1) and 2) directly follow from Lemma 

and the definition of Whittle index, indexability. To prove 
statement 3), note that since p„ < 1, there is a positive 
probability that there is no packet delivery for successive 
time slots, regardless of the policy employed. Thus, from any 
state i G {0,1, • • • , t„}, there is a path to the state r„. ■ 

As a result, the Whittle indices induce a well-defined order 
on the state values of each project. This gives the following 
heuristic policy. 

Whittle Index Policy. At the beginning of each time slot 
t, client n is scheduled if its Whittle index Wn (L„ (t)) is 
positive, and, moreover, is within the top aN index values of 
all clients in that slot. Ties are broken arbitrarily, with no more 
than aN clients simultaneously scheduled. 

Now, we show the asymptotic optimality property of the 
Whittle Index Policy. Classify the N projects into K classes 
such that the projects in the same class have the same values of 
Pn, Tn and En, while projects not in the same class differ in at 
least one of these parameters. For each class k G {!,■ ■ ■ , AT}, 
denote by 7 ^ the proportion of total projects that it contains; 
that is, there are projects in class k. 


Assumption 1. Construct the fluid model of the restless 
bandit problem 0-0 as in 0 and | [T0| , and denote the 
fluid process as z(t). We assume that, under the Whittle Index 
Policy, z{t) satisfies the global attractor property. That is, there 
exists z* such that from any initial point z(0), the process z(t) 
converges to the point z*, under the Whittle Index Policy. 


This assumption is not restrictive because of the following: 
First note that the MDP-2 itself also satisfies the unichain 
property. Then, under the Whittle Index policy, MDP-2 also 
forms a unichain. As a result, this N client bandit problem 
has a single recurrent class, and has a global attractor. Thus, 
it is not restrictive to assume that its fluid model also satisfies 
the global attractor property. 
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Fig. 1. The time average cost per client vs. the total number of clients for 
the optimal policy under the relaxed constraint and the Whittle Index policy 
are shown. (The parameters are a = 0.3, 77 = 0.1, with K = 2 classes of 
projects, and 71 = 0.5, 72 = 0.5 proportion of projects in each class. For 
each client n in the first class, pn = 0.6, Tn = 10, En = 2; while for each 
client n in the second class, pn = 0.8, Tn = 5, En = 3.) 


Fig. 2. The time average inter-delivery penalty per client vs. the time average 
energy consumption per client under the Whittle Index Policy for different 
values of energy-efficiency parameter rj are shown. (The parameters are N = 
100, Q = 0.3, with K = 2 classes of projects, and 71 = 0.5, 72 = 0.5 
proportion of projects in each class. For each client n in the first class, pn = 
0.6, Tn = 10, En = 2; while for each client n in the second class, pn = 0.8, 
Tn = b, En = 3.) 


Theorem 8. When Assumption [T| holds, as the number N 
of clients increases to infinity, Rmd/N —>■ R^^\/N, where Rmd 
and i?,ei is the system reward under the Whittle Index policy 
and the optimal relaxed policy, respectively. (Here, the fraction 
of active bandit a and the proportion of each bandit class 7 ^ 
remain the same when N increases. In addition, the client 
number N is such that all ■jkN are integers.) Thus, the Whittle 
Index policy is asymptotically optimal. 

Proof: By Assumption [T] and Lemma Rmd/N 
Rre\/N directly from the result in ijT^. Note that i?rei is an 
upper-bound for the stringent-constraint problem; thus, the 
asymptotic optimality holds. ■ 

VI. Simulation Results 

We now present the results of simulations of Whittle Index 
policy with respect to its average cost per client. The numerical 
results of the relaxed-constraint problem (|9]l-([T0li, which is 
derived by Theorem and Lemma are also employed to 
provide a bound on the stringent-constraint problem. 

Fig. [TJillustrates the average cost per client under the relaxed 
optimal policy and the Whittle Index policy for different total 
numbers of clients. It can be seen that when the total number of 
clients increases, the gap between the relaxed optimal cost and 
the cost under the Whittle Index policy shrinks to zero. Since 
the optimal cost of the relaxed-constraint problem serves as a 
lower bound on the cost in the stringent-constraint problem, 
this means the Whittle Index policy approaches the optimal 
cost as the total number of clients increases, i.e., the Whittle 
Index policy is asymptotically optimal. 

Fig. 1 ^ illustrates the average inter-delivery penalty per 
client versus the average energy consumption per client under 
the Whittle Index policy for different values of the energy- 
efficiency parameter 77 . As p increases, the average energy con¬ 
sumption decreases, while the average inter-delivery penalty 
increases. Thus, there is a tradeoff between energy-efficiency 
and inter-delivery regularity. By changing rj, we can balance 
these two important considerations. 
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